# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='...', project_access_token='...')
This notebook relates to the NOAA Weather Dataset - JFK Airport (New York). The dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.
In this notebook, we clean the raw dataset by:
Before you run this notebook complete the following steps:
When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:
More -> Insert project token
in the top-right menu sectionThis should insert a cell at the top of this notebook similar to the example given above.
If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below
Import and configure the required modules.
# Define required imports
import pandas as pd
import numpy as np
import sys
import re
# These set pandas max column and row display in the notebook
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
We start by reading in the raw dataset, displaying the first few rows of the dataframe, and taking a look at the columns and column types present.
# define filename
DATA_PATH = 'jfk_weather.csv'
def get_file_handle(fname):
# Project data path for the raw data file
data_path = project.get_file(fname)
data_path.seek(0)
return data_path
# Using pandas to read the data
# Since the `DATE` column consists date-time information, we use Pandas parse_dates keyword for easier data processing
data_path = get_file_handle(DATA_PATH)
raw_data = pd.read_csv(data_path, parse_dates=['DATE'])
raw_data.head()
raw_data.dtypes
As you can see above, there are a lot of fields which are non-numerical - usually these will be fields that contain text or categorical data, e.g. HOURLYSKYCONDITIONS
.
There are also fields - such as the main temperature field of interest HOURLYDRYBULBTEMPF
- that we expect to be numerical, but are instead object
type. This often indicates that there may be missing (or null
) values, or some other unusual readings that we may have to deal with (since otherwise the field would have been fully parsed as a numerical data type).
In addition, some fields relate to hourly observations, while others relate to daily or monthly intervals. For purposes of later exploratory data analysis, we will restrict the dataset to a certain subset of numerical fields that relate to hourly observations.
In this section, we refer to the NOAA Local Climatological Data Documentation to describe the fields and meaning of various values.
First, we select only the subset of data columns of interest and inspect the column types.
# Choose what columns to import from raw data
column_subset = [
'DATE',
'HOURLYVISIBILITY',
'HOURLYDRYBULBTEMPF',
'HOURLYWETBULBTEMPF',
'HOURLYDewPointTempF',
'HOURLYRelativeHumidity',
'HOURLYWindSpeed',
'HOURLYWindDirection',
'HOURLYStationPressure',
'HOURLYPressureTendency',
'HOURLYSeaLevelPressure',
'HOURLYPrecip',
'HOURLYAltimeterSetting'
]
# Filter dataset to relevant columns
hourly_data = raw_data[column_subset]
# Set date index
hourly_data = hourly_data.set_index(pd.DatetimeIndex(hourly_data['DATE']))
hourly_data.drop(['DATE'], axis=1, inplace=True)
hourly_data.replace(to_replace='*', value=np.nan, inplace=True)
hourly_data.head()
hourly_data.dtypes
From the dataframe preview above, we can see that the column HOURLYPrecip
- which is the hourly measure of precipitation levels - contains both NaN
and T
values. T
specifies trace amounts of precipitation, while NaN
means not a number, and is used to denote missing values.
We can also inspect the unique values present for the field.
hourly_data['HOURLYPrecip'].unique()
We can see that some values end with an s
(indicating snow), while there is a strange value 0.020.01s
which appears to be an error of some sort. To deal with T
values, we will set the observation to be 0
. We will also replace the erroneous value 0.020.01s
with NaN
.
# Fix imported data
hourly_data['HOURLYPrecip'].replace(to_replace='T', value='0.00', inplace=True)
hourly_data['HOURLYPrecip'].replace('0.020.01s', np.nan, inplace=True)
Next, we will convert string columns that refer to numerical values to numerical types. For columns such as HOURLYPrecip
, we first also drop the non-numerical parts of the value (the s
character).
# Set of columns to convert
messy_columns = column_subset[1:]
# Convert columns to float32 datatype
for i in messy_columns:
hourly_data[i] = hourly_data[i].apply(lambda x: re.sub('[^0-9,.-]', '', x) if type(x) == str else x).replace('', np.nan).astype(('float32'))
We can now see that all fields have numerical data type.
print(hourly_data.info())
print()
hourly_data.head()
Next, we will clean up some of the data columns to ensure their values fall within the parameters defined by the NOAA documentation (referred to above).
# Generate the summary statistics for each column
hourly_data.describe()
According to the documentation, the HOURLYPressureTendency
field should be an integer value in the range [0, 8]
. Let's check if this condition holds for this dataset.
# Check if categorical variable HOURLYPressureTendency ever has a non-integer entry outside the bounds of 0-8
cond = len(hourly_data[~hourly_data['HOURLYPressureTendency'].isin(list(range(0,9)) + [np.nan])])
print('Hourly Pressure Tendency should be between 0 and 8: {}'.format(cond == 0))
The HOURLYVISIBILITY
should be an integer in the range [0, 10]
. Let's check this condition too.
# Hourly Visibility should be between 0 and 10
hourly_data[(hourly_data['HOURLYVISIBILITY'] < 0) | (hourly_data['HOURLYVISIBILITY'] > 10)]
We find that a couple of observations fall outside the range. These must be spurious data observations and we handle them by replacing them with NaN
.
# Replace any hourly visibility figure outside these bounds with nan
hourly_data.loc[hourly_data['HOURLYVISIBILITY'] > 10, 'HOURLYVISIBILITY'] = np.nan
# Hourly Visibility should be between 0 and 10
cond = len(hourly_data[(hourly_data['HOURLYVISIBILITY'] < 0) | (hourly_data['HOURLYVISIBILITY'] > 10)])
print('Hourly Visibility should be between 0 and 10: {}'.format(cond == 0))
Finally, we check if there are any duplicates with respect to our DATE
index and check furthermore that our dates are in the correct order (that is, strictly increasing).
cond = len(hourly_data[hourly_data.index.duplicated()].sort_index())
print('Date index contains no duplicate entries: {}'.format(cond == 0))
# Make sure time index is sorted and increasing
print('Date index is strictly increasing: {}'.format(hourly_data.index.is_monotonic_increasing))
Most time-series analysis requires (or certainly works much better with) data that has fixed measurement intervals. As you may have noticed from the various data samples above, the measurement intervals for this dataset are not exactly hourly. So, we will use Pandas
' resampling functionality to create a dataset that has exact hourly measurement intervals.
# Resample (downsample) to hourly rows (we're shifting everything up by 9 minutes!)
hourly_data = hourly_data.resample('60min').last().shift(periods=1) #Note: use resample('60min', base=51) to resample on the 51st of every hour
We will now also replace missing values. For numerical values, we will linearly interpolate between the previous and next valid obvservations. For the categorical HOURLYPressureTendency
field, we will replace missing values with the last valid observation.
hourly_data['HOURLYPressureTendency'] = hourly_data['HOURLYPressureTendency'].fillna(method='ffill') # fill with last valid observation
hourly_data = hourly_data.interpolate(method='linear') # interpolate missing values
hourly_data.drop(hourly_data.index[0], inplace=True) # drop first row
print(hourly_data.info())
print()
hourly_data.head()
The final pre-processing step we will perform will be to handle two of our columns in a special way in order to correctly encode these features. They are:
HOURLYWindDirection
- wind directionHOURLYPressureTendency
- an indicator of pressure changesFor HOURLYWindDirection
, we encode the raw feature value as two new values, which measure the cyclical nature of wind direction - that is, we are encoding the compass-point nature of wind direction measurements.
# Transform HOURLYWindDirection into a cyclical variable using sin and cos transforms
hourly_data['HOURLYWindDirectionSin'] = np.sin(hourly_data['HOURLYWindDirection']*(2.*np.pi/360))
hourly_data['HOURLYWindDirectionCos'] = np.cos(hourly_data['HOURLYWindDirection']*(2.*np.pi/360))
hourly_data.drop(['HOURLYWindDirection'], axis=1, inplace=True)
For HOURLYPressureTendency
, the feature value is in fact a categorical
feature with three levels:
0-3
indicates an increase in pressure over the previous 3 hours4
indicates no change during the previous 3 hours5-8
indicates a decrease over the previous 3 hoursHence, we encode this feature into 3 dummy values representing these 3 potential states.
# Transform HOURLYPressureTendency into 3 dummy variables based on NOAA documentation
hourly_data['HOURLYPressureTendencyIncr'] = [1.0 if x in [0,1,2,3] else 0.0 for x in hourly_data['HOURLYPressureTendency']] # 0 through 3 indicates an increase in pressure over previous 3 hours
hourly_data['HOURLYPressureTendencyDecr'] = [1.0 if x in [5,6,7,8] else 0.0 for x in hourly_data['HOURLYPressureTendency']] # 5 through 8 indicates a decrease over previous 3 hours
hourly_data['HOURLYPressureTendencyConst'] = [1.0 if x == 4 else 0.0 for x in hourly_data['HOURLYPressureTendency']] # 4 indicates no change during previous 3 hours
hourly_data.drop(['HOURLYPressureTendency'], axis=1, inplace=True)
hourly_data['HOURLYPressureTendencyIncr'] = hourly_data['HOURLYPressureTendencyIncr'].astype(('float32'))
hourly_data['HOURLYPressureTendencyDecr'] = hourly_data['HOURLYPressureTendencyDecr'].astype(('float32'))
hourly_data['HOURLYPressureTendencyConst'] = hourly_data['HOURLYPressureTendencyConst'].astype(('float32'))
hourly_data.columns
# define the new column names
columns_new_name = [
'visibility',
'dry_bulb_temp_f',
'wet_bulb_temp_f',
'dew_point_temp_f',
'relative_humidity',
'wind_speed',
'station_pressure',
'sea_level_pressure',
'precip',
'altimeter_setting',
'wind_direction_sin',
'wind_direction_cos',
'pressure_tendency_incr',
'pressure_tendency_decr',
'pressure_tendency_const'
]
columns_name_map = {c:columns_new_name[i] for i, c in enumerate(hourly_data.columns)}
hourly_data_renamed = hourly_data.rename(columns=columns_name_map)
print(hourly_data_renamed.info())
print()
hourly_data_renamed.head()
# Explore some general information about the dataset
print('# of megabytes held by dataframe: ' + str(round(sys.getsizeof(hourly_data_renamed) / 1000000,2)))
print('# of features: ' + str(hourly_data_renamed.shape[1]))
print('# of observations: ' + str(hourly_data_renamed.shape[0]))
print('Start date: ' + str(hourly_data_renamed.index[0]))
print('End date: ' + str(hourly_data_renamed.index[-1]))
print('# of days: ' + str((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days))
print('# of months: ' + str(round((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days/30,2)))
print('# of years: ' + str(round((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days/365,2)))
Finally, we save the cleaned dataset as a Project asset for later re-use. You should see an output like the one below if successful:
{'file_name': 'jfk_weather_cleaned.csv',
'message': 'File saved to project storage.',
'bucket_name': 'jfkweatherdata-donotdelete-pr-...',
'asset_id': '...'}
Note: In order for this step to work, your project token (see the first cell of this notebook) must have Editor
role. By default this will overwrite any existing file.
project.save_data("jfk_weather_cleaned.csv", hourly_data_renamed.to_csv(float_format='%g'), overwrite=True)
Part 2 - Data Analysis
notebook to explore the cleaned dataset.This notebook was created by the Center for Open-Source Data & AI Technologies.
Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.